Ameliorative missing value imputation for robust biological knowledge inference
نویسندگان
چکیده
Gene expression data is widely used in various post genomic analyses. The data is often probed using microarrays due to their ability to simultaneously measure the expressions of thousands of genes. The expression data, however, contains significant numbers of missing values, which can impact on subsequent biological analysis. To minimize the impact of these missing values, several imputation algorithms including Collateral Missing Value Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute (LSImpute), Local Least Square Impute (LLSImpute), and K-Nearest Neighbour (KNN) have been proposed. These algorithms, however, exploit either only the global or local correlation structure of the data, which normally can lead to higher estimation errors. This paper presents an Ameliorative Missing Value Imputation (AMVI) technique which has ability to exploit global/local and positive/negative correlations in a given dataset by automatic selection of the optimal number of predictor genes k using a wrapper non-parametric method based on Monte Carlo simulations. The AMVI technique has CMVE strategy at its core because CMVE has demonstrated improved performance compared to both low variance methods like BPCA, LLSImpute, and high variance methods such as KNN and ZeroImpute, as CMVE exploits positive/negative correlations. The performance of AMVI is compared with CMVE, BPCA, LLSImpute, and KNN by randomly removing between 1% and 15% missing values in eight different ovarian, breast cancer and yeast datasets. Together with the standard NRMS error metric, the True Positive (TP) rate of the significant genes selection, biological significance of the selected genes and the statistical significance test results are presented to investigate the impact of missing values on subsequent biological analysis. The enhanced performance of AMVI was demonstrated by its lower NRMS error, improved TP rate, bio significance of the selected genes and statistical significance test results, when compared with the aforementioned imputation methods across all the datasets. The results show that AMVI adapted to the latent correlation structure of the data and proved to be an effective and robust approach compared with the trial and error methodology for selecting k. The results confirmed that AMVI can be successfully applied to accurately impute missing values prior to any microarray data analysis.
منابع مشابه
How to Improve Postgenomic Knowledge Discovery Using Imputation
While microarrays make it feasible to rapidly investigate many complex biological problems, their multistep fabrication has the proclivity for error at every stage. The standard tactic has been to either ignore or regard erroneous gene readings as missing values, though this assumption can exert a major influence upon postgenomic knowledge discovery methods like gene selection and gene regulato...
متن کاملInfluence of Pattern of Missing Data on Performance of Imputation Methods: An Example from National Data on Drug Injection in Prisons
Background Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern...
متن کاملMissing data imputation in multivariable time series data
Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...
متن کاملST-14 Handling Missing Data with Multiple Imputation Using PROC MI in SAS
The multiple imputation was developed as a general method for inference with missing data. Instead replacing the missing observation with a single value, multiple imputation method replaces each missing value with multiple plausible values. PROC MI in SAS creates multiply imputed data sets for incomplete multivariate data. This study reviews multiple imputation as an analytic strategy for missi...
متن کاملHeuristic Non Parametric Collateral Missing Value Imputation: A Step Towards Robust Post-genomic Knowledge Discovery
Microarrays are able to measure the patterns of expression of thousands of genes in a genome to give profiles that facilitate much faster analysis of biological processes for diagnosis, prognosis and tailored drug discovery. Microarrays, however, commonly have missing values which can result in erroneous downstream analysis. To impute these missing values, various algorithms have been proposed ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of biomedical informatics
دوره 41 4 شماره
صفحات -
تاریخ انتشار 2008